1. Introduction & Web Scraping

In this project I am going to analyse the data about best movies from website metacritic (https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc). The goal of the project is to scrap different interesting characteristics of the movies such as year of release, director, distributor, runtime, country, and metascore. Then to run simple EDA to understand data better and, finally, I would like to study what factors influence runtime of the movies.

Data from initial page

library(rvest)
library(dplyr)

url <- "https://www.metacritic.com/browse/movies/genre/metascore/thriller?view=detailed"
page <- read_html(url)

title <- page %>%
  html_nodes(".title h3") %>%
  html_text() 

metascore <- page %>%
  html_nodes(".clamp-score-wrap .positive") %>%
  html_text()

date <- page %>%
  html_nodes(".clamp-details span:nth-child(1)")%>%
  html_text()

2. Data preprocessing

glimpse(movies_df)
## Rows: 100
## Columns: 7
## $ title       <chr> "The Godfather", "Rear Window", "Vertigo", "Notorious", "T…
## $ date        <chr> "March 24, 1972", "September 1, 1954", "May 28, 1958", "Se…
## $ metascore   <chr> "100", "100", "100", "100", "99", "98", "98", "98", "98", …
## $ distributor <chr> "Paramount Pictures", "Paramount Pictures", "Paramount Pic…
## $ director    <chr> "Francis Ford Coppola", "Alfred Hitchcock", "Alfred Hitchc…
## $ country     <chr> "USA", "US", "US", "US", "US", "GB", "USA,Spain,Mexico", "…
## $ runtime     <named list> "175 min", "112 min", "128 min", "101 min", "95 min…

Everything looks good, except for date and runtime. I need only years and runtime without minutes for analysis, not full date.

Let’s extract years from date and save them into new column:

movies_df$year <- sub(".*,", "", movies_df$date)

Let’s extract years from date and save them into new column:

movies_df$time <- sub("min", "", movies_df$runtime)

However, there is also a small problem with country variables, because some movies actually contains several countries.

movies_df$country_1 <- sub(",.*", "", movies_df$country)

The same problem with director variable.

movies_df$director_1 <- sub(",.*", "", movies_df$director)
movies_df <- movies_df %>%
  select(- runtime, - date, - director, - country)

Also, there is an another problem with country variable, because the same countries are written differently (e.g., US and USA)

movies_df <- movies_df %>%
    mutate(country_1 = recode(country_1, 
  "US" = "USA", 
  "GB" = "UK",
  "DE" = "Germany", 
  "JP" = "Japan", 
  "Hong Kong" = "China"))

Transformation of variables into correct type

Numeric variables

movies_df$metascore <- as.numeric(movies_df$metascore)
movies_df$year <- as.numeric(movies_df$year)
movies_df$time <- as.numeric(movies_df$time)

Factor variables

movies_df$country_1 <- as.factor(movies_df$country_1)
movies_df$distributor <- as.factor(movies_df$distributor)

Now data is clean, correct and ready for analysis.

glimpse(movies_df)
## Rows: 100
## Columns: 7
## $ title       <chr> "The Godfather", "Rear Window", "Vertigo", "Notorious", "T…
## $ metascore   <dbl> 100, 100, 100, 100, 99, 98, 98, 98, 98, 97, 97, 97, 97, 97…
## $ distributor <fct> "Paramount Pictures", "Paramount Pictures", "Paramount Pic…
## $ year        <dbl> 1972, 1954, 1958, 1946, 1958, 1938, 2006, 1959, 2002, 1955…
## $ time        <dbl> 175, 112, 128, 101, 95, 96, 118, 136, 153, 92, 104, 95, 12…
## $ country_1   <fct> USA, USA, USA, USA, USA, UK, USA, USA, Germany, USA, UK, U…
## $ director_1  <chr> "Francis Ford Coppola", "Alfred Hitchcock", "Alfred Hitchc…

3. Exploratory data analysis

Basic statistics

var_names <- movies_df %>%
  rename(`Year of release` = year,
         `Runtime of movie` = time, 
         `Metascore` = metascore, 
         `Country` = country_1, 
         `Distributor` = distributor)
         
         
var_names <- var_names %>%
  select(- title, - director_1)

caption_1 <- "Table 1. Sample descriptive statistics for continious variables"

library(modelsummary)
datasummary_skim(var_names, title = caption_1)
Table 1. Sample descriptive statistics for continious variables
Unique (#) Missing (%) Mean SD Min Median Max
Metascore 15 0 91.4 4.0 86.0 90.5 100.0
Year of release 61 0 1985.4 27.1 1926.0 1988.0 2023.0
Runtime of movie 65 2 119.8 33.2 72.0 113.0 325.0
caption_2 <- "Table 2. Sample descriptive statistics for categorical variables"
datasummary_skim(var_names, type = "categorical", title = caption_2)
Table 2. Sample descriptive statistics for categorical variables
N %
Distributor A24 3 3.0
ARRAY Releasing 1 1.0
British Lion Film Corporation 2 2.0
Bryanston Distributing 1 1.0
Cinelicious Pics 1 1.0
Cineriz 1 1.0
CJ Entertainment 1 1.0
Columbia Pictures 6 6.0
Compass International Pictures 1 1.0
Filmways Pictures 1 1.0
Fine Line Features 1 1.0
Fox Searchlight Pictures 1 1.0
Gaumont British Distributors 1 1.0
Geffen Company, The 1 1.0
Goskino 1 1.0
Gramercy Pictures (I) 1 1.0
Grasshopper Film 1 1.0
Home Box Office (HBO) 1 1.0
IFC Films 1 1.0
Janus Film 1 1.0
Lopert Pictures Corporation 1 1.0
Metro-Goldwyn-Mayer (MGM) 5 5.0
Miramax 1 1.0
Miramax Films 3 3.0
Motion Picture Export Association (MPEA) 1 1.0
Neon 1 1.0
Netflix 1 1.0
Newmarket Films 1 1.0
Open Road Films (II) 1 1.0
Paramount Pictures 11 11.0
Paramount Vantage 1 1.0
Picturehouse 1 1.0
Pierre Grise Distribution 1 1.0
Rialto Pictures 2 2.0
RKO Radio Pictures 1 1.0
Roadside Attractions 1 1.0
Royal Films International 1 1.0
Samuel Goldwyn Films 1 1.0
Selznick Releasing Organization 1 1.0
Sony Pictures Classics 3 3.0
Summit Entertainment 1 1.0
The Cinema Guild 2 2.0
Times Film Corporation 1 1.0
Toho Company 2 2.0
Turtle Releasing 1 1.0
Twentieth Century Fox Film Corporation 2 2.0
United Artists 9 9.0
Universal Pictures 6 6.0
Warner Bros.  6 6.0
Warner Bros. Pictures 3 3.0
Country AU 1 1.0
FR 1 1.0
France 4 4.0
Germany 9 9.0
India 1 1.0
IR 1 1.0
IT 1 1.0
Japan 3 3.0
KR 1 1.0
Spain 1 1.0
SUHH 1 1.0
UK 11 11.0
USA 65 65.0

Visualization with plotly

library(plotly)
plot_ly(movies_df, x =~time, y=~metascore,  type = 'scatter', mode = 'markers') %>%
   layout(title = 'Correlation between time and metascore',
         xaxis = list(title = 'Runtime'),
         yaxis = list(title = 'Metascore')) 
plot_ly(movies_df, x =~year, y=~metascore,  type = 'scatter', mode = 'markers') %>%
   layout(title = 'Correlation between year of release and metascore',
         xaxis = list(title = 'Year'),
         yaxis = list(title = 'Metascore'))

4. Linear regression

What factors influence runtime of the movies?

To be able to use categorical variable country, I need to decrease the number of categories there. I decided to create binary variable, that reflects whether the country of the film USA or not.

movies_df_2 <- movies_df %>%
  mutate(country_binary = ifelse(country_1 %in% c('USA'), 'USA', 'not_USA'))
library(sjPlot)

labs = c("Constant", "Year of release", 
         "Meta score",
         "Country (USA)")

model <- lm(time ~ year + metascore + country_binary, data = movies_df_2)


tab_model(model, pred.labels = labs, title = "Table 1. Linear regression: Factors that influence runtime of the best movies of all times",
          dv.labels = "Runtime")
Table 1. Linear regression: Factors that influence runtime of the best movies of all times
  Runtime
Predictors Estimates CI p
Constant -812.05 -1347.24 – -276.86 0.003
Year of release 0.40 0.16 – 0.64 0.001
Meta score 1.51 -0.13 – 3.14 0.070
Country (USA) -0.07 -13.64 – 13.49 0.991
Observations 98
R2 / R2 adjusted 0.115 / 0.087
  • Every one unit increase in year of release leads to 0.4 increase in runtime of movies on average, holding everything else constant (p-value = 0.001).

  • Other variables are not statistically significant at explaining runtime of the movies.

  • Adjusted R-squared equals to 0.115, it means that only 12% of variance in runtime of movies can be explained by the model. Thus, I can conclude that the explanatory power is not good enough.